Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize the TableVectorizer column-wise #592

Merged
merged 62 commits into from
Sep 20, 2023

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Jun 12, 2023

Fix #586

@LeoGrin LeoGrin changed the title Table vec parallelism Parallelise TableVectorizer on the columns Jun 12, 2023
@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 13, 2023

It works on the latest sklearn version (1.2.2), but not on the older ones (I think it would be quite painful to make it work). Do we still want to support only recent sklearn versions? @GaelVaroquaux

@GaelVaroquaux
Copy link
Member

It works on the latest sklearn version (1.2.2), [...]Do we still want to support only recent sklearn versions?

+1 for supporting only sklearn 1.2.2 and later (can we try 1.2.1 ?).

Can someone update our requirements and CI build matrix? @LilianBoulard ?

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 13, 2023

+1 for supporting only sklearn 1.2.2 and later (can we try 1.2.1 ?).

Cool ! It also works for 1.2.1

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 13, 2023

Something I don't really like is that the transformers_ attribute is a bit weird for high-cardinality columns:

[('datetime', DatetimeEncoder(), ['date_of_stop', 'time_of_stop']),
 ('low_card_cat',
  OneHotEncoder(drop='if_binary', handle_unknown='ignore'),
  ['agency',
   'subagency',
   'accident',
   'belts',
    ...
   'race',
   'gender',
   'driver_state',
   'dl_state',
   'arrest_type']),
 ('high_card_cat', GapEncoder(n_components=30), ['seqid']),
 ('high_card_cat', GapEncoder(n_components=30), ['description']),
 ('high_card_cat', GapEncoder(n_components=30), ['location']),
 ('high_card_cat', GapEncoder(n_components=30), ['search_reason_for_stop']),
 ('high_card_cat', GapEncoder(n_components=30), ['make']),
 ('high_card_cat', GapEncoder(n_components=30), ['model']),
 ('high_card_cat', GapEncoder(n_components=30), ['charge']),
 ('high_card_cat', GapEncoder(n_components=30), ['driver_city']),
 ('high_card_cat', GapEncoder(n_components=30), ['geolocation']),
 ('remainder',
  'passthrough',
  ['latitude', 'longitude', 'year', 'contributed_to_accident'])]

but I'm not sure we can avoid it with this solution.

@LilianBoulard LilianBoulard added the enhancement New feature or request label Jun 15, 2023
@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 16, 2023

@glemaitre do you have thoughts on this? Do you think it could be worth implementing the parallelisation directly in sklearn?

@glemaitre
Copy link
Member

glemaitre commented Jun 16, 2023

I am not convinced that the strategy is beneficial for all types of transformers.

To give a concrete example, I think it would be faster to run a StandardScaler on a matrix of several columns instead of paralleling StandardScaler for each column of the matrix.

So the proposed strategy would make sense if internally to the transformer, there is already an explicit for loop. But in this case, I would expect to get a single instance and pass a matrix instead of having multiple instances for each column. If the parallelization is beneficial for this transformer, I would then expect the transformer to internally parallelize.

I might be missing some details to grasp the reason to dispatch a column per transformer here.

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 16, 2023

Thanks for your answer @glemaitre !

I am not convinced that the strategy is beneficial for all types of transformers.
To give a concrete example, I think it would be faster to run a StandardScaler on a matrix of several columns instead of paralleling StandardScaler for each column of the matrix.

Agree! The idea would be to have a list of transformer classes for which it is useful.

So the proposed strategy would make sense if internally to the transformer, there is already an explicit for loop. But in this case, I would expect to get a single instance and pass a matrix instead of having multiple instances for each column. If the parallelization is beneficial for this transformer, I would then expect the transformer to internally parallelize.
I might be missing some details to grasp the reason to dispatch a column per transformer here.

But right now the transformers used inside the ColumnTransformer are not internally parallelized, right? My first instinct was indeed to parallelize these transformers, but it would create nested parallelism. To avoid the nesting, we discussed favoring the inner loop and removing the ColumnTransformer's own parallelism, but it was deemed too surprising for the user*. This is how we ended up dispatching a column per transformer (see discussion in #586), but there may be better solutions.

*Now that I think of it, it is indeed suprising to set the TableVectorize's n_jobs to 1 to remove the inherited ColumnTransformer parallelism, but it may work if implemented directly on sklearn? Maybe having a option to either parallelize all transformers internally and running them sequencially, or runing all transformers in parallel, but not parallelize them internally.

@glemaitre
Copy link
Member

Yep, I see the issue regarding nested parallelism and oversubscription. But I don't know if we should special case or instead better handle the nested parallelism in joblib (which is a far harder task to do).

Having the nested parallelism should be somehow manageable but require the user to set properly the n_jobs of the different transformers and there is not a good default that just makes things work smoothly. But I don't know if this is worth more than having some internal hackish way to go around the problem.

I just thought now that UX-wise, having multiple instances of the same transformer for each column will not be user-friendly when it comes to inspection. We could always rebuild a single instance but it smells hackish from far away (and then there is the same issue regarding the parallelism at transform time).

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 16, 2023

I just thought now that UX-wise, having multiple instances of the same transformer for each column will not be user-friendly when it comes to inspection. We could always rebuild a single instance but it smells hackish from far away (and then there is the same issue regarding the parallelism at transform time).

Yes I agree

Having the nested parallelism should be somehow manageable but require the user to set properly the n_jobs of the different transformers and there is not a good default that just makes things work smoothly. But I don't know if this is worth more than having some internal hackish way to go around the problem.

If the user provides a n_jobs for the TableVectorizer but don't touch the default internal transformers, do you think there a way to set the n_jobs of the different transformers to make it work nicely? Or it is what you mean by "there is not a good default"?

@glemaitre
Copy link
Member

Or it is what you mean by "there is not a good default"?

It is what I meant :). joblib already tries to not get some catastrophic over-subscription in the case of nested parallelism and I don't think that we can do better (at least in an obvious manner). But @GaelVaroquaux and @ogrisel know better than me the internals on this subject.

@LilianBoulard LilianBoulard changed the title Parallelise TableVectorizer on the columns Parallelize the TableVectorizer column-wise Jun 19, 2023
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @LeoGrin, here is a couple of additional remarks :)

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @LeoGrin, here is a couple of additional remarks :)

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're almost done, here are some final comments

CHANGES.rst Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Show resolved Hide resolved
skrub/_table_vectorizer.py Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @LeoGrin, that's a great feature! Let's add some missing docstring, then LGTM :)

@Vincent-Maladiere
Copy link
Member

You have conflicts with the main branch

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Sep 1, 2023

Thanks for all the comments @Vincent-Maladiere !

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some last docstring comments

skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@LeoGrin LeoGrin merged commit f981247 into skrub-data:main Sep 20, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better parallelism for TableVectorizer
6 participants